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1  Introduction 

In  the  modern  business  world,  collaborative  data  mining  becomes  especially  im¬ 
portant  because  of  the  mutual  benefit  it  brings  to  the  collaborators.  During  the 
collaboration,  each  party  of  the  collaboration  needs  to  share  its  data  with  other 
parties.  If  the  parties  don’t  care  about  their  data  privacy,  the  collaboration  can  be 
easily  achieved.  However,  if  the  parties  don’t  want  to  disclose  their  private  data  to 
each  other,  can  they  still  achieve  the  collaboration? 

To  use  the  existing  data  mining  algorithms,  all  parties  need  to  send  their 
data  to  a  trusted  central  place  to  conduct  the  mining.  However  in  situations  with 
privacy  concerns,  parties  may  not  trust  anyone,  including  a  third  party.  Generic  so¬ 
lutions  for  any  kind  of  secure  collaborative  computing  exist  in  the  literature  (e.g.,  [7] 
and  [3]).  However,  none  of  the  proposed  generic  solutions  is  practical  in  handling 
large-scale  data  sets  because  of  the  prohibitive  extra  cost  in  protecting  data  pri¬ 
vacy.  Therefore,  practical  solutions  need  to  be  developed.  This  need  underlies  the 
rationale  for  our  research. 

Data  mining  includes  a  number  of  different  tasks.  This  paper  focuses  on 
sequential  pattern  mining.  Specially,  we  study  the  problem  of  how  to  jointly  mining 
sequential  patterns  among  multiple  parties  while  preserving  data  privacy  of  each 
party.  To  the  best  of  our  knowledge,  the  problem  has  not  been  investigated  and  is  a 
challenge  to  the  information  security  and  privacy  community.  Our  contributions  are 
(1)  to  propose  a  new  representation  scheme  for  sequential  data  in  order  to  facilitate 
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mining  sequential  patterns  among  private  data  sets,  and  (2)  to  develop  a  new  secure 
protocol  for  multiple  parties  to  jointly  compute  the  support  measure  of  sequential 
patterns. 

The  paper  is  organized  as  follows:  Section  2  discusses  the  related  work.  We 
then  formally  defines  the  mining  sequential  patterns  on  private  data  problem  in 
Section  3.  In  Section  4,  we  describe  our  secure  protocols.  We  give  our  conclusion 
in  Section  5. 


2  Related  Work 

Privacy-Preservation  Multi-Party  Data  Mining  In  the  early  work  on  privacy¬ 
preserving  data  mining,  Lindell  and  Pinkas  [4]  proposed  a  solution  to  the  privacy¬ 
preserving  classification  problem  using  the  oblivious  transfer  protocol,  a  power¬ 
ful  tool  developed  by  the  secure  multi-party  computation  research.  Vaidya  and 
Clifton  [5]  proposed  to  use  the  scalar  product  as  the  basic  component  to  tackle 
the  problem  of  association  rule  mining  in  vertically  partitioned  data.  Later,  they 
proposed  a  permutation  scheme  to  solve  the  K-means  clustering  [6]  over  vertically 
partitioned  data.  In  [8],  a  secure  procedure  is  provided  to  solve  privacy-preserving 
collaborative  data  mining. 

Sequential  Pattern  Mining  Sequential  pattern  mining,  introduced  in  [1],  is  con¬ 
cerned  of  inducing  rules  from  a  set  of  sequences  of  ordered  items.  Their  method  cal¬ 
culates  the  support  measures  of  sequences  by  iteratively  joining  those  sub-sequences 
whose  supports  exceed  a  given  threshold.  However,  to  the  best  of  our  knowledge, 
the  issue  of  secure  sequential  pattern  mining  has  not  been  studied.  In  this  paper, 
we  will  propose  a  scheme  to  tackle  the  problem  of  privacy-preserving  sequential 
pattern  mining  over  the  vertically  partitioned  data. 

3  Privacy-Preserving  Collaborative  Sequential 
Pattern  Mining 

3.1  Background 

Since  its  introduction  in  1995  [1] ,  the  sequential  pattern  mining  has  received  a  great 
deal  of  attention.  It  is  still  one  of  the  most  popular  pattern-discovery  methods  in 
the  field  of  Knowledge  Discovery.  In  the  sequential  pattern  mining,  we  are  given 
a  data  set  D  of  customer  transactions.  Each  transaction  consists  of  the  following 
fields:  customer-id,  transaction-time,  and  the  items  purchased  in  the  transaction. 
No  customer  has  more  than  one  transaction  with  the  same  transaction-time.  We 
do  not  consider  quantities  of  items  bought  in  a  transaction:  each  item  is  a  binary 
variable  representing  whether  an  item  was  bought  or  not.  An  itemset  is  a  non¬ 
empty  set  of  items.  A  sequence  is  an  ordered  list  of  itemsets.  A  customer  support 
is  a  sequence  s  if  s  is  contained  in  the  customer-sequence  for  this  customer.  The 
support  for  a  sequence  is  defined  as  the  fraction  of  total  customers  who  support 
this  sequence.  Given  a  data  set  D  of  customer  transactions,  the  problem  of  mining 
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sequential  patterns  is  to  find  the  maximal  sequences  among  all  sequences  that  have 
a  certain  user-specified  minimum  support.  Each  such  maximal  sequence  represents 
a  sequential  pattern. 

3.2  Problem  Definition 

We  consider  the  scenario  where  multiple  parties,  each  having  a  private  data  set 
(denoted  by  Di,  D2,  ■  •  and  Dn  respectively),  want  to  collaboratively  conduct  se¬ 
quential  pattern  mining  on  the  union  of  their  data  sets.  Because  they  are  concerned 
about  the  data  privacy,  neither  party  is  willing  to  disclose  its  raw  data  set  to  others. 
Without  loss  of  generality,  we  make  the  following  assumptions  on  the  data  sets  (the 
assumptions  can  be  achieved  by  pre-processing  the  data  sets  D\,  D2,  •••,  and  Dn, 
and  such  pre-processing  does  not  require  one  party  to  send  its  data  set  to  other 
parties): 

1.  Di,  D2,  ■  ■  and  Dn  are  data  sets,  where  each  data  set  consists  of  the  customer- 
id,  transaction-time,  and  the  items  purchased  in  each  transaction. 

2.  Di,  D2,  •  •  •,  and  Dn  contain  the  different  types  of  items  (e.g.,  they  come  from 
different  types  of  markets). 

3.  The  identity  of  the  transactions  in  D 1,  D2,  ■  ■  ■,  and  Dn  are  the  same. 

4.  The  customer-ID  and  customer’s  transaction  time  can  be  shared  among  the 
parities,  but  the  items  that  a  customer  actually  bought  are  confidential. 

Privacy-Preserving  Collaborative  Sequential  Pattern  Mining  problem:  Party  1 
has  a  private  data  set  D\,  party  2  has  a  private  data  set  D2,  ■■■,  and  party  n 
has  a  private  data  set  Dn.  Data  set  [D\  U  D2  U  •  •  •  U  Dn\  is  the  union  of  D 1, 
D2:  ■■  ■,  and  Dn  (by  vertically  putting  Di,  D2,  •••,  and  Dn  together.)  Let  N  be 
a  set  of  transactions  with  representing  the  fcth  transaction.  These  n  parties 
want  to  conduct  the  sequential  pattern  mining  on  [Di  U  D2  U  H3  •  •  •  U  Dn]  and 
to  find  the  sequential  patterns  with  support  greater  than  the  given  threshold,  but 
they  do  not  want  to  share  their  private  data  sets  with  each  other.  We  say  that  a 
sequential  pattern  of  ay  <  yj,  where  ay  occurs  before  or  at  the  same  time  as  yj,  has 
support  s  in  [Di  U  D2  U  •  •  •  U  Dn]  if  s%  of  the  transactions  in  [Di  U  D2  ■  ■  •  U  Dn] 
contain  both  ay  and  yj  with  ay  happening  before  or  at  the  same  time  as  yj  (namely, 
a%  =  P{xi  n  yj\xi  <  yj)). 


3.3  Sequential  Pattern  Mining  Procedure 

The  procedure  of  mining  sequential  patterns  contains  the  following  steps: 

Step  I:  Sorting 

The  data  set  [Dx  U  D2  ■  ■  ■  U  Dn]  is  sorted,  with  customer-id  as  the  major  key 
and  transaction-time  as  the  minor  key.  This  step  implicitly  converts  the  original 
transaction  data  set  into  a  data  set  of  customer  sequences.  Since  the  customer-id 
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and  the  transaction  time  are  not  private  among  the  parties,  this  step  can  be  exe¬ 
cuted  without  using  a  secure  protocol.  As  a  result,  transactions  of  a  customer  may 
appear  in  more  than  one  rows,  where  a  row  contains  information  of  a  customer 
ID,  a  particular  transaction  time  and  items  bought  at  this  transaction  time.  For 
example,  suppose  that  data  sets  after  being  sorted  by  their  customer-id  numbers 
are  shown  in  Fig.  1.  Then  after  being  sorted  by  the  transaction  time,  data  tables 
of  Fig.  1  will  become  those  shown  in  Fig.  2. 


C-ID  T-time  Items  Bought  C-ID  T-time  Items  Bought  C-ID  T-time  Items  Bought 


1 

06/25/03 

30 

2 

06/10/03 

10,20 

2 

06/20/03 

9, 15 

3 

06/25/03 

30 

3 

06/30/03 

5,  10 

1 

06/28/03 

110 

2 

06/13/03 

107 

3 

06/19/03 

103 

3 

06/26/03 

105,  106 

3 

06/21/03 

101, 102 

1 

06/30/03 

90 

2 

06/15/03 

40,60 

3 

06/18/03 

35,50 

3 

06/10/03 

45,70 

Figure  1.  Raw  Data  Sorted  By  Customer  ID 

Alice  Bob  Carol 

C-ID  T-tme  Items  Bought  C-ID  T-time  Item  Bought  C-ID  T-time  Item  Bought 


1 

06/25/03 

30 

N/A 

N/A 

2 

06/10/03 

10,  20 

N/A 

N/A 

2 

06/20/03 

9,  15 

N/A 

N/A 

N/A 

... 

N/A 

3 

06/25/03 

30 

N/A 

3 

06/30/03 

5,  10 

N/A:  The  information  is  not  available. 


N/A 

N/A 

1 

06/30/03 

90 

N/A 

N/A 

. 

2 

06/15/03 

40,  60 

N/A 

3 

06/10/03 

45,  70 

3 

06/18/03 

35,  50 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

1 

06/28/03 

110 

N/A 

N/A 

2 

06/13/03 

107 

N/A 

N/A 

N/A 

N/A 

3 

06/19/03 

103 

3 

06/21/03 

101,  102 

N/A 

3 

06/26/03 

105,  106 

N/A 

Figure  2.  Raw  Data  Sorted  By  Customer  ID  and  Transaction  Time 
Step  II:  Mapping 

Each  item  of  a  row  is  considered  as  an  attribute.  We  map  each  item  of  a 
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row  (i.e.,  an  attribute)  to  an  integer  in  an  increasing  order  and  repeat  for  all  rows. 
Re-occurrence  of  an  item  will  be  mapped  to  the  same  integer.  As  a  result,  each 
item  becomes  an  attribute  and  all  attributes  are  binary- valued.  For  instance,  the 
sequence  <  B ,  (A,  C)  >,  indicating  that  the  transaction  B  occurs  prior  to  the  trans¬ 
action  (A,C)  with  A  and  C  being  simultaneous  events,  will  be  mapped  to  integers 
in  the  order  B  — >  1,  A  — >  2,  C  — >  3,  (A,  C)  — >  4.  During  the  mapping,  the  corre¬ 
sponding  transaction  time  will  be  kept.  For  instance,  based  on  the  sorted  data  set 
of  Fig.  2,  we  may  construct  the  mapping  table  as  shown  in  Fig.  3.  We  use  to 
denote  the  mapping  in  Fig.  3.  For  example,  '30  —  1  A!  means  that  we  map  30  to  1  A. 
After  the  mapping,  the  mapped  data  sets  are  shown  in  Fig.  4. 


Alice 

30-  1A 

10  -2A 

20-3A 

(10, 20)  -  4A 

9-5A 

15-6A 

(9,  15)  -7A 

5-8A 

(5, 10)  -  9A 

Bob 

90-  IB 

40-2B 

60-3B 

(40, 60)  -  4B 

35-5B 

50-6B 

(35, 50)  -  7B 

45-8B 

70-9B 

(45, 70)-  10B 

Carol 

110- 1C 

107  -  2C 

103  -  3C 

101  - 4C 

102  - 5C 

(101,102)-  6C 

105  -  7C 

106  -  8C 

(105,106)  -  9C 

Note  that,  in  Alice’s  dataset,  item  30  and  10  are  reoccurred,  so  we  map  them  to  the  same  mapped-ID. 


Figure  3.  Mapping  Table 


Alice 


Figure  4.  Data  After  Being  Mapped 

Step  III:  Mining 

Our  mining  procedure  will  be  based  on  mapped  data  set  after  the  mapping 
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step.  The  general  sequential  pattern  mining  procedure  contains  multiple  passes  over 
the  data.  In  each  pass,  we  start  with  a  seed  set  of  large  sequences,  where  a  large 
sequence  refers  to  a  sequence  whose  itemsets  all  satisfy  the  minimum  support.  We 
utilize  the  seed  set  for  generating  new  potentially  large  sequences,  called  candidate 
sequences.  We  find  the  support  for  these  candidate  sequences  during  the  pass  over 
the  data.  At  the  end  of  each  pass,  we  determine  which  of  the  candidate  sequences 
are  actually  large.  These  large  candidates  become  the  seed  for  the  next  pass.  The 
following  is  the  procedure  for  mining  sequential  patterns  on  [D\  U  Di  •  •  •  U  Dn\. 

1.  L  i  =  large  1-sequence 

2.  for  (k  =  2;  Lk-i  ^  0;  k++)  do{ 

3.  Ck  =  apriori-generate(Lfc_i) 

4.  for  all  candidates  c  £  Ck  do  { 

5.  Compute  c.  count 

(Section  3.4  will  show  how  to  compute  this  count  on  private  data) 

6.  Lk  =  Lk  U  c  |  c. count  >  minsup 

7.  end 

8.  end 

9.  Return  ULk 

where  Lk  stands  for  a  sequence  with  k  itemsets  and  Ck  stands  for  the  collection  of 
candidate  k-sequences.  The  procedure  apriori-generate  is  described  as  follows: 

Step  1:  join  Lk- 1  with  Lk- 1: 

1.  insert  into  Ck 

2.  select  p.litemseti,  ■■■,  p.litemsetk-i,  q.litemsetk-i,  where  p.litemseti  = 

q.litemseti,  ■  ■  •, 

p.litemsetk-2  —  q.litemsetk- 2 

3.  from  Lk- 1  p,  Lk- 1  q. 

Step  2:  delete  all  sequences  c  £  Ck  such  that  some  (k-l)-subsequence  of  c  is  not 
in  Lk~  1. 

Step  IV:  Maximization 

Having  found  the  set  of  all  large  sequences  S  in  the  sequence  phase,  we  provide 
the  following  procedure  to  find  the  maximal  sequences. 

1.  for  (k  =  m;  k  <  1;  k-  -)  do 

2.  for  each  k-sequence  sk  do 

3.  Delete  all  subsequences  of  sk  from  S 


Step  V:  Converting 

The  items  in  the  final  large  sequences  are  converted  back  to  the  original  item 
representation  before  the  mapping  step.  For  example,  if  1A  belongs  to  some  large 
sequential  pattern,  then  1A  will  be  converted  to  item  30,  according  to  the  mapping 
table,  in  the  final  large  sequential  patterns. 
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3.4  How  to  compute  c. count 

To  compute  c.  count,  in  other  words,  to  compute  the  support  for  some  candidate 
pattern  (e.g.,  P(xj  fl  j/j  (~l  Zi\xi  >  yi  >  Zi)),  we  need  to  consider  two  aspects:  one  is 
to  deal  with  the  condition  part  where  z,:  occurs  before  yi  and  both  of  them  occur 
before  xp,  the  other  is  to  compute  the  actual  counts  for  this  sequential  pattern. 

If  all  the  candidates  belong  to  just  one  party,  then  c.  count,  which  refers  to 
the  frequency  counts  for  candidates,  can  be  computed  by  this  party  alone  since  this 
party  has  all  the  information.  However,  if  the  candidates  belong  to  different  parties, 
it  is  a  non-trivial  problem  to  conduct  the  joint  frequency  counts  while  protecting 
privacy  of  data.  We  provide  the  following  steps  to  conduct  this  cross-parties’  com¬ 
putation. 

Step  I:  Vector  construction 

The  parities  construct  vectors  for  their  own  attributes  (mapped-id).  Suppose  we 
want  to  compute  the  c.  count  for  2 A  >  2 B  >  6 C  in  Fig.  4.  We  construct  three 
vectors:  2A,  2B  and  6C  as  in  Fig.  5. 


2A  6C  2B 


Figure  5.  An  Protocol  To  Compute  c.  count 
Step  II:  Transaction  time  comparison 

1.  All  the  parties  randomly  generate  a  set  of  resonable1  transaction  time  for 

1  Transaction  time  in  the  parties’s  data  sets  is  usually  distributed  over  some  periods.  By  rea¬ 
sonable,  we  mean  the  randomly  generated  time  should  fall  into  these  periods.  If  a  party  fails  to 
do  so,  it  would  let  other  parties  to  easily  make  a  correct  guess  of  its  entry  values. 
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entries  in  the  vector  where  their  values  are  0.  The  purpose  of  this  step  is  to 
prevent  one  party  from  correctly  guessing  other  parties’  data. 

2.  Randomly  select  one  party  who  will  receive  the  transaction  time  vectors  from 
the  other  two  parties.  For  instance,  in  our  example,  Carol  is  selected  to  receive 
transaction  time  vectors  for  2A  and  2 B  from  Alice  and  Bob. 

3.  Carol  then  compares  the  transaction  time  of  each  entry  of  2 A,  2B,  and  6C. 
She  makes  a  temporary  vector  T.  If  the  transaction  time  does  not  satisfy  the 
requirement  of  2 A  >  2 B  >  6 C,  she  sets  the  corresponding  entries  of  T  to  0’s; 
otherwise,  she  copies  the  original  values  in  6C  to  T  (Fig.  5). 

In  the  data  sets,  if  the  item  value  is  0,  then  there  is  no  transaction  time  asso¬ 
ciated  with  it.  Therefore,  if  one  party  (e.g.,  Alice)  sends  her  actual  transaction  time 
vector  to  other  parties,  then  other  parties  can  immediately  know  the  values  of  those 
entries  whose  transaction  time  data  are  not  present.  Thus,  it  leads  to  an  informa¬ 
tion  leak.  To  enhance  the  data  privacy,  instead  of  sending  the  actual  transaction 
time  vector  to  other  parties,  Alice  randomly  generates  a  set  of  random  transaction 
time  for  those  entries  that  have  values  0’s.  In  other  words,  Alice  adds  some  random 
noise  to  the  transaction  time  vector.  By  doing  so,  other  parties  (e.g.,  Bob)  cannot 
directly  know  which  transaction  does  not  occur.  If  Bob  takes  a  random  guess  for 
each  entry  value  based  on  received  transaction  time  vector,  he  has  the  probability 
of  0.5  to  make  a  correct  guess  provided  that  he  has  no  other  additional  known  in¬ 
formation.  With  this  random  transaction  time  generation  method,  we  can  enhance 
the  data  privacy. 

Step  III:  Compute  c.  count 

After  they  compare  their  transaction  time  for  the  candidate  vectors  (e.g.,  2A, 
2B  and  T),  they  apply  the  secure  protocols,  which  will  be  discussed  in  Section  4,  on 
the  value  vectors  to  obtain  the  c. count.  For  example,  to  obtain  c. count  for  2A',  2 B\ 
and  6C  in  Fig.  5,  they  need  to  compute  J2iL i  A'[i]-Z[i]-T[i]  =  Y^h=i  = 

0,  where  N  is  the  total  number  of  values  in  each  vector. 

4  Secure  Protocols 

How  two  or  multiple  parties  jointly  compute  c.  count  without  revealing  their  raw  data 
to  each  other  is  the  challenge  that  we  want  to  address.  We  propose  two  protocols  to 
tackle  this  challenge:  One  is  for  two  parties  to  conduct  the  multiplication  operation; 
the  other,  with  the  first  protocol  as  the  basis,  is  designed  for  the  secure  multi-party 
product  operation. 

4.1  Introducing  The  Commodity  Server 

For  performance  reasons,  we  use  an  extra  server,  the  commodity  server  [2]  in  our 
protocol.  The  parties  could  send  requests  to  the  commodity  server  and  receive  data 
(called  commodities )  from  the  server,  but  the  commodities  must  be  independent  of 
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the  parties’  private  data.  The  purpose  of  the  commodities  is  to  help  the  parties 
conduct  the  desired  computations. 

The  commodity  server  is  semi-trusted  in  the  following  senses:  (1)  It  is  not 
trusted,  cannot  derive  the  private  information  of  the  data  from  the  parties,  and 
cannot  learn  the  computation  result.  (2)  It  will  not  collude  with  all  the  parties.  (3) 
It  follows  the  protocol  correctly.  Because  of  these  characteristics,  we  say  that  it  is 
a  semi-trusted  party.  In  the  real  world,  finding  such  a  semi-trusted  party  is  much 
easier  than  finding  a  trusted  party. 

4.2  Component  Protocols 

Let’s  first  consider  the  case  of  two  parties  where  n  =  2  (more  general  cases  where 
n  >  3  will  be  discussed  later) .  Alice  has  a  vector  X  and  Bob  has  a  vector  Y.  Both 
vectors  have  N  elements.  Alice  and  Bob  want  to  compute  the  product  between  X 
and  Y  such  that  Alice  gets  Y2iLi  Ux[i\  and  Bob  gets  J2iLi  Uv]ji\,  where  YliLi  Ux[i]  + 
Uy[i]  =  J2iL i  -X"[*]  •!’’[*]  =  XY.  Uy[i\  and  Ux[i]  are  random  numbers.  Namely, 
the  scalar  product  of  X  and  Y  is  divided  into  two  secret  pieces,  with  one  piece  going 
to  Alice  and  the  other  going  to  Bob.  We  assume  that  random  numbers  are  generated 
from  the  integer  domain. 

Protocol  1.  (Secure  Two-party  Protocol) 

1.  The  Commodity  Server  generates  two  random  numbers  l?x[l]  and  Ry[  1]  ,  and 

lets  rx [1]  +  ry  [1]  =  .Rx[l]  •  [1],  where  rx[  1]  (or  ry[l])  is  a  randomly  generated 

number.  Then  the  server  sends  (i?x[l], rx[l])  to  Alice,  and  (i?y[l],  ry[l])  to 
Bob. 

2.  Alice  sends  A[l]  =  A[l]  +  i?x[l]  to  Bob. 

3.  Bob  sends  F[l]  =  F[l]  +  i?.y[l]  to  Alice. 

4.  Bob  generates  a  random  number  Uy[  1],  and  computes  A[l]  •  F[l]  +  (ry[  1]  — 
Uy{  1]),  then  sends  the  result  to  Alice. 

5.  Alice  computes  (X[l]  •  Y[  1]  +  (ry[l]  -  Uy[  1]))  -  (i?x[l]  •  Y[l])  +  rx[l]  =  X[l\  ■ 
F[l]  -  Uy[l]  +  (ry [1]  -  i?x[l]  •  Ry[  1]  +  Ml])  =  X[l]  •  Y[l]  -  Uy[  1]  =  Ux[l}. 

6.  Repeat  step  1-5  to  compute  X[i\  -Fft]  for  i  e  [2,  N ].  Alice  then  gets  Ux[i] 
and  Bob  gets  Uy[i\- 


Theorem  1.  Protocol  1  is  secure  such  that  Alice  cannot  learn  Y  and  Bob  cannot 
learn  X  either. 

Proof.  The  number  X[i\  =  X[i\  +  Rx[i\  is  all  what  Bob  gets.  Because  of  the 
randomness  and  the  secrecy  of  Rx[i],  Bob  cannot  find  out  X[i).  According  to  the 
protocol,  Alice  gets  (1)  Y[i]  —  Y[i}+  Ry[i],  (2)  Z[i]  =  X[i\-Y[i\  +  (ry[i\  —  Uy[i\),  and 
(3)  rx[i],  Rx[i],  where  rx[i]  +ry[i\  =  Rx[i\- Ry[i\.  We  will  show  that  for  any  arbitrary 
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Y'[i],  there  exists  r'y[{\,  R'y[i\  and  Uy[i\  that  satisfies  the  above  equations.  Assume 
Y'[i]  is  an  arbitrary  number.  Let  Ry[i ]  =  Y[i\  —  Y’[i],  r'y[i ]  =  Rx[i]  ■  Ry[i ]  —  rx[i ], 
and  U.y[i\  =  X[?']  •  Y'[i\  +  r'y[i].  Therefore,  Alice  has  (1)  Y[i]  =  Y'[i\  +  Ry[i\,  (2) 
Z[i]  =  X[i]-Y'[i]  +  (ry[i]-Uy[i])  and  (3)  rx[i],  Rx[i],  where  rx[i]+r'y[i\  =  Rx[i\-Ry[i\. 
Thus,  from  what  Alice  learns,  there  exists  infinite  possible  values  for  Y[i],  Therefore, 
Alice  cannot  know  Y  and  neither  can  Bob  know  X.  □ 

We  have  discussed  our  protocol  of  secure  number  product  for  two  parties. 
Next,  we  will  consider  the  protocol  for  securely  computing  the  number  product  for 
multiple  parties.  For  simplicity,  we  only  describe  the  protocol  when  n  =  3.  The 
protocols  for  the  cases  when  n  >  3  can  be  similarly  derived. 

Protocol  2.  (Secure  Multi-party  Protocol) 

Step  I 

1.  Alice  generates  a  random  number  i?x[l], 

2.  Bob  generates  two  random  numbers  Ry[  1]  and  Ry[  1]. 

3.  Carol  generates  a  random  number  i?2[l]. 

Step  II 

1.  Carol  computes  Z\  1]  +  Rz[  1]  and  sends  it  to  Bob. 

2.  Bob  computes  F[l]  +  i?y[l]  and  sends  it  to  Alice. 

3.  Alice  computes  T[l]  =  (X[l]  +  Rx[l])  *  (Y[l]  +  Ry[  1])  *  (Z[  1]  +  Rz[  1]). 

4.  Alice  and  Bob  use  Protocol  1  to  compute  X[l] -Ry  [1],  i?x[l]  •  Y[l],  I?x[l] -Ry  [1], 
A[l]  •  Y[l]  and  I?x[l]  ■  R'y[\}.  Then  Alice  obtains  f7x[l],  Ux[ 2],  Ux[ 3],  Ux[4]  and 
Ux[5]  and  Bob  obtains  Uy[  1],  Uy[ 2],  17y[3],  Uy[ 4]  and  Uv[ 5]. 

5.  Alice  sends  ( Ux[l\  +  Ux[2\  +  Ux[3\ )  and  (Ux[4]  +  Ux[l]  +  Ux[2]  +  Ux[5])  to  Carol. 

6.  Bob  sends  (Uy[l\  +  Uy[2]  +  Uy[ 3])  and  (Uy[ 4]  +  Uy[  1]  +  Uy[ 2]  +  Uv[ 5])  to  Carol. 

7.  Carol  computes 

Ti[l]  =  (Y[l]  •  Ry[  1]  +  ^[1]  •  Y[l]  +  Rx[  1]  •  Ry[l])  •  Z[  1],  and 

T2[l]  =  (A[l]  •  Y[l]  +  X[l]  ■  Ry[  1]  +  Rx{  1]  •  Y[l]  +  Rx{  1]  •  R'v[  1])  •  Rz[  1]. 

8.  Alice  and  Carol  use  Protocol  1  to  compute  I?x[l]  •  i?z[l]  and  send  the  values 
they  obtained  from  the  protocol  to  Bob. 

9.  Bob  computes  T3[l]  =  I?"[l]  •  iZx[l]  •  I?2[l]. 

Step  III 

1.  Repeat  the  Step  I  and  Step  II  to  compute  T[i],  Ti[i],  T2[j]  and  T3[i]  for  i  G 

[2  iN}. 
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2.  Alice  then  gets 

[a]  =  Eh  T[i\  =  Eh (X[i\  +  RM  *  (Y[i\  +  Ry[i\)  *  (Z[i\  +  Rz[i\). 

3.  Bob  gets 

IQ  =  Eh  m  =  Ehm + RyM)  *  m + «• 

4.  Carol  gets 

[C]  =  Eh  TiM  =  Eili(^[*]  •  Ry\A  +  Rx[i]  •  Y[i\  +  Rx[i\  ■  Ry[i ])  •  Z[i],  and 

id]  -  EhT M  =  EhmY^}+x^-Ry^+Rx[i}-Y[i]+Rx[i\-R'y[i})-Rz[^ 

Note  that  Eh  xli\ '  Y[i]  ■  Z\i \  =  Eh  ToW  =  M  -  [b]  -  [c]  -  [d\. 

Theorem  2.  Protocol  2  is  secure  such  that  Alice  cannot  learn  Y  and  Z ,  Bob  cannot 
learn  X  and  Z ,  and  Carol  cannot  learn  X  and  Y . 

Proof.  According  to  the  protocol,  Alice  obtains  (1)  (T[?']  +  Ry[i}),  and  (2 )(Z[i]  + 
Rz  [*] )  •  Bob  gets  (1  )Rx[i\-Rz[i\.  Carol  gets  (l)(X[i}-Ry[i}  +  Rx[i]-Y[i}  +  Rx[i}-Ry[i}) 
and  (2 )(X[i]  •  Y[{\  +  X[i]  ■  Ry[i]  +  Rx[i\  ■  Y[{\  +  Rx[i\  ■  R'y[i\). 

Since  i?y[i](=  ( Ry[i\  +  [?’]))  and  Rz[i]  are  arbitrary  random  numbers. 

From  what  Alice  learns,  there  exists  infinite  possible  values  for  Y[i }  and  Z[i\.  From 
what  Bob  learns,  there  also  exists  infinite  possible  values  for  Z[i\.  From  what  Carol 
learns,  there  still  exists  infinite  possible  values  for  X[i]  and  Y[i\. 

Therefore,  Alice  cannot  learn  Y  and  Z,  Bob  cannot  learn  X  and  Z,  and  Carol 
cannot  learn  X  and  Y  either. 

□ 

5  Conclusion 

In  this  paper,  we  considered  the  problem  of  mining  sequential  patterns  with  multiple 
data  sets.  In  order  to  effectively  compute  the  support  measure  of  a  sequence,  we 
proposed  a  mapping  scheme  which  converts  data  sequences  into  a  binary  matrix 
representation  for  frequency  count  of  patterns.  We  presented  and  analyzed  our 
proposed  secure  protocol  designed  for  multiple  parties  to  jointly  conduct  sequential 
pattern  mining.  Within  this  secure  protocol,  we  introduced  the  multiple-party 
number  product  computation  as  the  basic  building  block.  In  our  future  work,  we 
will  extend  our  method  to  deal  with  other  data  mining  algorithms. 
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